The idea of this project is to deepen the know-how of an analysis method and show the practial application of the knowledge gained during the IODS course.
In the following chapter I explore the relations between the data in the Hobbies set of the FactoMineR package, using the MCA analysis.
The goal of this project is to explore how hobbies relate to the variables describing individuals (gender, age, profession and marital status) in the dataset. I am particularly interested in seeing how the following hypotheses relate to my data:
As mentioned above, this project uses data from the FactoMineR package.
One of the suggested datasets for this final assignment was the Hobbies set, which contains an extract of a 2003 “Histoire de vie” questionnaire conducted by the French National Institute of Statistics, l’INSEE. In this part of the study, 8403 individuals aged 15 or more were asked 18 questions about their hobbies. The following 4 variables were used to label the respondents:
The question concernig the hobbies was the following: “Have you done or been involved in the following hobby in the past 12 months, without ever have been obliged to do it?” The dataset included in the FactoMineR package is a data frame with 8403 rows and 23 columns. The rows represent the individuals, columns represent the different questions. The first 18 questions are active ones, the 4 following ones are supplementary categorical variables (describing the respondents) and the 23th is a supplementary quantitative variable (the number of activities).
Under these links my processed data in the .csv format and the script used to process the data can be found.
The data wrangling included the following steps:
The pre-processed dataset contains 6905 observations of 19 variables.
In this part I will show some clear and interesting explorations of the variables of interest in the Hobbies data.
library(dplyr)
hobbies<- read.table("C:\\Users\\E130-WIN7\\Documents\\hobbies.csv", sep = ",", header = TRUE)
hobbies<-dplyr::select(hobbies, -X)
summary(hobbies)
## Reading Cinema Show Exhibition Computer Sport
## No :2265 No :4135 No :4901 No :4746 No :4296 No :4361
## Yes:4640 Yes:2770 Yes:2004 Yes:2159 Yes:2609 Yes:2544
##
##
##
##
##
## Walking Travelling Collecting Volunteering Mechanic Gardening
## No :3378 No :4098 No :6148 No :5820 No :3868 No :4068
## Yes:3527 Yes:2807 Yes: 757 Yes:1085 Yes:3037 Yes:2837
##
##
##
##
##
## Knitting Cooking Fishing Sex Age Marital.status
## No :5725 No :3829 No :6122 F:3772 (45,55]:1624 Divorcee : 712
## Yes:1180 Yes:3076 Yes: 783 M:3133 (35,45]:1455 Married :3631
## (25,35]:1183 Remarried: 352
## (55,65]:1067 Single :1609
## (65,75]: 713 Widower : 601
## [15,25]: 456
## (Other): 407
## Profession
## Employee :2552
## Foreman : 735
## Management :1052
## Manual labourer :1161
## Other : 212
## Technician : 401
## Unskilled worker: 792
library(tidyr)
library(ggplot2)
hobbies_general<-dplyr::select(hobbies, -Age, -Marital.status, -Profession, -Sex)
gather(hobbies_general) %>% ggplot(aes(value)) + ggtitle("Distribution of hobbies across all respondents") + facet_wrap("key", scales = "free") + geom_bar(fill = "darkolivegreen") + theme(text = element_text(size=15),
axis.text.x = element_text(angle=45, hjust=1))
It is interesting to see how the hobbies are distributed and how some of them are clearly more popular than others - the above plots show the data across all respondents.
There are some interesting observations I had not predicted. Collecting, fishing and knitting are not popular; reading is a hobby where the number of people doing it is much higher than those who do not. Volunteering is also unpopular.
library(tidyr)
library(ggplot2)
individuals<-dplyr::select(hobbies, -Cinema, -Collecting, -Computer, -Cooking, -Exhibition, -Fishing, -Gardening, -Knitting, -Mechanic, -Reading, -Show, -Sport, -Travelling, -Volunteering, -Walking)
gather(individuals) %>% ggplot(aes(value)) + ggtitle("Characteristics of respondents") + facet_wrap("key", scales = "free") + geom_bar(fill = "darkseagreen") + theme(text = element_text(size=15),
axis.text.x = element_text(angle=45, hjust=1))
We can see that age of the participants is quite evenly distributed, with more young people than old ones. Most of the questionnaire participants are 25-65 years old, while the smallest group is, unsurprisingly, 86-100. The gender distribution is quit even, with slightly more females. Most of the respondents are married. For profession, the most common situation is employee, which means a non-manual type of work.
Multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space.
library(FactoMineR)
mca <- MCA(hobbies, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = hobbies, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.183 0.115 0.104 0.076 0.065 0.059
## % of var. 10.537 6.595 5.978 4.372 3.721 3.405
## Cumulative % of var. 10.537 17.132 23.110 27.482 31.203 34.608
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.057 0.055 0.055 0.054 0.053 0.052
## % of var. 3.260 3.184 3.169 3.116 3.070 3.012
## Cumulative % of var. 37.868 41.052 44.221 47.337 50.407 53.419
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18
## Variance 0.052 0.052 0.050 0.050 0.049 0.046
## % of var. 2.994 2.968 2.903 2.881 2.798 2.664
## Cumulative % of var. 56.413 59.382 62.284 65.166 67.963 70.627
## Dim.19 Dim.20 Dim.21 Dim.22 Dim.23 Dim.24
## Variance 0.045 0.043 0.042 0.041 0.038 0.038
## % of var. 2.588 2.469 2.419 2.339 2.196 2.173
## Cumulative % of var. 73.216 75.685 78.104 80.443 82.639 84.812
## Dim.25 Dim.26 Dim.27 Dim.28 Dim.29 Dim.30
## Variance 0.036 0.034 0.033 0.032 0.031 0.030
## % of var. 2.092 1.940 1.927 1.831 1.768 1.735
## Cumulative % of var. 86.904 88.845 90.772 92.603 94.371 96.106
## Dim.31 Dim.32 Dim.33
## Variance 0.029 0.022 0.017
## % of var. 1.658 1.264 0.972
## Cumulative % of var. 97.763 99.028 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.690 0.038 0.276 | 0.187 0.004 0.020 | 0.275
## 2 | -0.001 0.000 0.000 | 0.106 0.001 0.005 | -0.049
## 3 | -0.077 0.000 0.006 | 0.069 0.001 0.005 | 0.003
## 4 | -0.548 0.024 0.282 | -0.445 0.025 0.186 | 0.288
## 5 | -0.173 0.002 0.029 | -0.404 0.021 0.158 | -0.068
## 6 | 0.578 0.026 0.235 | -0.070 0.001 0.003 | -0.422
## 7 | 0.076 0.000 0.005 | -0.009 0.000 0.000 | -0.343
## 8 | 0.062 0.000 0.002 | -0.124 0.002 0.009 | -0.926
## 9 | 0.235 0.004 0.022 | 0.390 0.019 0.061 | -0.057
## 10 | 0.484 0.019 0.197 | 0.147 0.003 0.018 | -0.334
## ctr cos2
## 1 0.011 0.044 |
## 2 0.000 0.001 |
## 3 0.000 0.000 |
## 4 0.012 0.078 |
## 5 0.001 0.004 |
## 6 0.025 0.125 |
## 7 0.016 0.100 |
## 8 0.120 0.512 |
## 9 0.000 0.001 |
## 10 0.016 0.094 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## Reading_No | -0.646 3.937 0.204 -37.501 | -0.474 3.379 0.109
## Reading_Yes | 0.315 1.922 0.204 37.501 | 0.231 1.650 0.109
## Cinema_No | -0.524 4.726 0.410 -53.181 | 0.056 0.087 0.005
## Cinema_Yes | 0.782 7.055 0.410 53.181 | -0.084 0.130 0.005
## Show_No | -0.385 3.027 0.363 -50.035 | -0.075 0.183 0.014
## Show_Yes | 0.942 7.402 0.363 50.035 | 0.183 0.448 0.014
## Exhibition_No | -0.412 3.362 0.374 -50.804 | -0.128 0.515 0.036
## Exhibition_Yes | 0.907 7.390 0.374 50.804 | 0.281 1.132 0.036
## Computer_No | -0.465 3.874 0.357 -49.614 | 0.177 0.891 0.051
## Computer_Yes | 0.766 6.379 0.357 49.614 | -0.291 1.467 0.051
## v.test Dim.3 ctr cos2 v.test
## Reading_No -27.489 | -0.013 0.003 0.000 -0.763 |
## Reading_Yes 27.489 | 0.006 0.001 0.000 0.763 |
## Cinema_No 5.703 | 0.241 1.763 0.087 24.464 |
## Cinema_Yes -5.703 | -0.360 2.632 0.087 -24.464 |
## Show_No -9.742 | 0.040 0.058 0.004 5.218 |
## Show_Yes 9.742 | -0.098 0.142 0.004 -5.218 |
## Exhibition_No -15.732 | -0.083 0.240 0.015 -10.214 |
## Exhibition_Yes 15.732 | 0.182 0.526 0.015 10.214 |
## Computer_No 18.826 | 0.160 0.809 0.042 17.075 |
## Computer_Yes -18.826 | -0.264 1.332 0.042 -17.075 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Reading | 0.204 0.109 0.000 |
## Cinema | 0.410 0.005 0.087 |
## Show | 0.363 0.014 0.004 |
## Exhibition | 0.374 0.036 0.015 |
## Computer | 0.357 0.051 0.042 |
## Sport | 0.321 0.028 0.018 |
## Walking | 0.161 0.029 0.054 |
## Travelling | 0.357 0.010 0.018 |
## Collecting | 0.028 0.000 0.012 |
## Volunteering | 0.101 0.011 0.046 |
The Eigenvalues show that the first dimension has 10% of the total variance and the second one retains 8%.
None of the squared correlation between the variables and dimensions are close to 1, meaning that there doesn’t seem to be any strong correlations between them. The strongest seem to be computer, sport and travelling to dimension 1, while knitting and fishing seem to be the strongest related to dimension 2.
The following plots illustrate the dominant dimensions of the data and a more detailed view of the variables which are related to the two dimensions.
library("factoextra")
res.mca <- MCA(hobbies, graph = FALSE)
eig.val <- get_eigenvalue(res.mca)
# head(eig.val)
fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 45))
fviz_mca_var(res.mca, choice = "mca.cor",
repel = TRUE,
ggtheme = theme_minimal())
The following plot is the factor map.
In here, most of the variables are concentrated in the middle. Variables such as being a widower, a preference for knitting, and old age are far away from the centre and close to the dimension 1. Being a technician or a manual labourer, fishing, and a young male are factors contributing to dimension 2.
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
Here, each hobby is represented separately as a plot - pink meaning yes and blue meaning no.
# visualize MCA
plotellipses(mca, keepvar = 1:5, means = FALSE, label = "quali")
# visualize MCA
plotellipses(mca, keepvar = 6:10, means = FALSE, label = "quali")
# visualize MCA
plotellipses(mca, keepvar = 11:15, means = FALSE, label = "quali")
Not having a hobby is connected with the left-hand side, while having a hobby is related to the right-hand side.
plotellipses(mca, keepvar = 16, means = FALSE, label = "quali")
plotellipses(mca, keepvar = 17, means = FALSE, label = "quali")
plotellipses(mca, keepvar = 18, means = FALSE, label = "quali")
plotellipses(mca, keepvar = 19, means = FALSE, label = "quali")
## Conclusions and discussion
Conclusions and discussion (max 2 points)